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Abstract 

A stronger relation exists between two nodes when they 
each point to one another [reciprocal edge) as compared 
to when only one points to the other {one-way edge). 
The proportion of reciprocal edges in a network is a good 
indicator of how tight the relations among the nodes are. 
The mutual relations could be an indication of friendship 
in a graph with social behavior or the information flow 
in a communication network. Despite their importance, 
reciprocal edges have been disregarded by most directed 
graph models. In our study, we propose a directed graph 
model that (i) combines the correct proportions of both 
reciprocal and one-way edges, (ii) matches the in-, out-, and 
reciprocal-degree distributions of the fitted graph, and (iii) 
requires only 0{m) work for a graph with m edges, making 
it scalable to very large graphs. We show the effectiveness 
of the proposed model on several real-world graphs and 
compare it to other state-of-the-art models. 

1 Introduction 

Is the connection between two nodes one-way or two- 
way (reciprocal)! We can infer a lot from the answer to 
this question. Assume we are given an e-mail exchange 
network. If two people are sending messages to each 
other, there is a real connection between them. How- 
ever, if the message is only being sent in one direction, 
we cannot infer whether these two people know each 
other or not (e.g., sender could be a spammer). The 
high amount of reciprocal edges also is an indicator of 
a social behaviour in a network. 

To study the impact of reciprocal edges, we cate- 
gorize the directed edges into two types: reciprocal and 



one-way (see Figure 1). Formally, we say an edge (u,w) 
is reciprocal if the corresponding edge {v, u) also exists; 
otherwise, we say it is one-way. 
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Figure 1: A directed graph with reciprocal (e.g., B-D) 
and one-way (e.g., D-A) edges. 

The reciprocity, r, measures the density of recip- 
rocal edges in a network [15] where r is the ratio of 
the number of reciprocal edges to the total number of 
edges. It can be interpreted as the probability of a ran- 
dom edge in a network being reciprocated. The reci- 
procity ratio is higher in the networks with social flavor 
(e.g., twitter, flickr, e-mail) whereas it is lower in infor- 
mation networks (e.g., web, news forums) or temporal 
networks (e.g., citation); see Table 1. It was observed 
that if a network has more reciprocal edges, viruses or 
news spread more quickly [15]. The reciprocity mea- 
sure defined in [15] cannot distinguish networks with 
different densities (dense networks tend to have higher 
r values) and it can be exaggerated by self-links. The 
reciprocity measure can be improved by, for example, 
removing self-links and normalizing using the density 
[5]. 

In another interesting study [12], the formation 
order of the reciprocal edges is analyzed in some social 
interaction based networks such as Flickr (which has 
68% reciprocal edges), and it is found that 83% of all 
reciprocal edges are created within 48 hours after the 
initial edge creation. Twitter is also analyzed in terms 
of reciprocity and found that 22.1% of the edges are 
reciprocal [7], which shows that twitter is the mixture 
of social and information accessing network. 

Our interest is in realistic generative models that 
capture, at a minimum, the degree distributions of real- 
world networks. It is important to study networks 
to understand formation behaviours, detect abnormali- 
ties, and increase robustness. However, real networks 



may not be released due to privacy and security is- 
sues. Therefore, random network generators reproduc- 
ing salient features of real networks are required. 

One of the most common salient features among 
networks is that degree distributions of many real 
networks is heavy-tailed [2]. Existing directed graph 
models such as Stochastic Kronecker Graphs (SKG) 
[9, 8] and Forest Fire (FF) [10] roughly match the given 
in-degree and out-degree distributions. Moreover, these 
and many other directed graph models do not generate 
any or many reciprocal edges. 

Another concern in random graph generation is 
scalability. As the sizes of networks get larger and 
larger, generating random graphs in a scalable manner 
becomes crucial. Many graph generators add edges 
incrementally, creating new edges based on previous 
insertions. Another drawback is finding the right 
parameters of the models (e.g., SKG, FF, etc.) is 
very time intensive for large scale graphs. Without 
costly parameter estimation steps, we need to generate 
realistic directed graphs with reciprocal edges in a 
scalable manner. 

1.1 Contributions 

• Reciprocal edges represent social exchanges in the 
networks; however, most generative models specif- 
ically designed for social networks are missing this 
crucial behavior. We propose the Fast Reciprocal 
Directed (FRD) graph generator which explicitly 
matches the reciprocal degree distribution as well 
as the in- and out- degree distributions. 

• A critical component of the FRD is a fast model 
for generating a directed graph without reciprocal 
edges. For that purpose, we propose the Fast 
Directed (FD) graph generator, which is a close 
cousin of the Chung-Lu [3] and edge configuration 
models [14, 1]. 

• Neither FRD nor FD require any "model fitting" 
parameters beyond the target degree distributions. 

• Both of our proposed models are fast, generating 
m edges in 0{m) time for a constant maximum 
degree. Our models take less than a minute to 
generate a graph with multi-million nodes and 
edges, faster than any comparable models. 

• Wc also explain why the number of degree- 1 nodes 
is much lower than intended in Chung-Lu like 
models [3, 17] and propose a solution to obtain a 
better match for the degree-1 vertices. 

2 Related Work 

In this section, we consider existing directed genera- 
tive graph models. Most previous models suffer from 
some combination of the following problems: few or no 



reciprocal edges, unable to match various degree dis- 
tributions precisely, lack of scalability in fitting and/or 
generation (most models require some "history" to pick 
the next set of edges). 

Kleinberg et al. [6] propose the Edge Copying (EC) 
model for web networks based on the observation that 
web network has topic-based clusters. In the EC model, 
when a new node arrives, it selects a random vertex v 
and copies a specified number of links k of vertex v 
[6]. The EC model has no mechanism for reciprocal 
edges since new nodes always point to older nodes. 
Leskovec et al. proposes the Forest Fire (FF) model [10] 
in which a new node can connect both to the vertices 
pointing to vertex v or to the vertices that vertex 
V points to. The edge creation continues with the 
neighbors of the neighbors of w, in other words fire 
spreads in a region. The FF model has forward pf 
and backward pb burning probabilities which are used 
to specify the density(seriousness) of the fire region. 
FF is a state-of-the-art model and will be used in our 
comparative studies. Like EC, FF model cannot create 
reverse (reciprocal) links between two nodes. Also, 
both methods are serial in nature because each new set 
of links for a new vertex depends on the graph that 
has been created thus far. To fit FF to a real graph, 
the forward pf and backward pb burning parameters 
must be adjusted to match the number of the edges 
in the fitted graph. In large scale networks, each 
fitting attempt with different burning parameters is very 
expensive. In practice, the required time of the FF 
fitting is (number of attempts to find the paramaters) 
X (FF graph generation time), but we only report the 
generation time (after fitting is complete) in our studies. 

Unlike the EC model and FF model, the Stochastic 
Kronecker Graph (SKG) model [9, 8] is a scalable model. 
It is also considered to be state-of-the-art and is used 
for comparison in our experiments. The SKG model 
begins with an initiator matrix (typically 2x2) and 
produces larger graphs by recursive Kronecker product 
much faster than incremental methods. For large scale 
graphs, computing the initiator matrix is very time 
intensive [8]. In fact, wc were unable to compute the 
initiator matrix for the larger graphs in a reasonable 
time frame; therefore for the remainder, we use the 
initiator matrices that have been previously reported 
in the literature [8, 19]. As with FF, we only report the 
generation time for the graphs. 

In another directed model, the growth of Wikipedia 
is imitated and reciprocal edges are partly sup- 
ported [21, 20]. This model extends the well-known 
preferential attachment (PA) model [2] by including 
reciprocity measure r. Each node arrives at a time and 
connects to k sink vertices, and then reciprocal links 



are formed from those sink vertices to the newly added 
node with the probability of reciprocity, r. They test 
their model only for Wikipedia network and show that 
it matches the in-degree distribution well; however, it 
deviates sharply from the real out-degree distribution. 
Like EC and FF, this model is not scalable. 

The work of this paper is closely related to the 
Chung-Lu (CL) model [3], which generates an undi- 
rected graph whose degree distribution matches to the 
given degree distribution. The CL model creates an 
edge between Vi and Vj proportional to the product of 
their degrees. In the CL model, each edge creation is 
done by independent coin flips and therefore, it is not 
suitable for scaling. A "fast" CL model that behaves like 
SKG model in terms of several network measures was 
introduced in [16]. Another model is the Edge Configu- 
ration (EdgeCon) model [14, 1] creates a set containing 
di copies of each vertex Vi and then chooses random 
matching of the elements in the set to create edges. 

3 Proposed Directed Graph Models 

In this study, we propose two scalable directed graph 
models: the Fast Directed (FD) model which generates 
a directed graph G = (V, E) with respect to the given 
in- and out-degree distributions, and the Fast Reciprocal 
Directed (FRD) which explicitly accounts for reciprocal 
edges as well. 

Before going into the details, we present the nota- 
tion. Given a directed graph G, let n be the number 
of nodes and m be the number of directed edges. For 
instance, in Figure 1, n = 5 and m = 7. We divide the 
edges into three types: 

• d'^ = reciprocal degree (each reciprocal edge cor- 
responds to a pair of directed edges) , 

• d'^ = in-degree (excluding reciprocal edges), and 

• d~^ = out-degree (excluding reciprocal edges). 
We also define the total in- and out- degrees, which 
include the reciprocal edges, i.e., 

• df" = + dP' = total in-degree, and 

• df" ^ dj^ + d'l* = total out-degree. 

Most directed graph models consider only the total in- 
and out-degrees, ignoring reciprocity. As an example of 
these measures, node B in Figure 1 has d"^ = 2, d"^ = 2, 
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We may also assemble corresponding degree distri- 
butions, as follows. For any d = 0,1, . . . , define 

• = Number of nodes with reciprocal-degree d, 

• = Number of nodes with in-degree d, 

• = Number of nodes with out-degree d, 

• = Number of nodes with total-in-degree d, and 

• = Number of nodes with total-out-degree d. 
Let draax bc the maximum of all possible degrees. Then 



we can express n and m as 

rfmax dmax d„ 



d=0 d=0 d=0 

imax dmax 



d ■ 



d=l 
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The reciprocity ratio of a graph [15] is 

# reciprocated edges ^2^=1 d • 



# edges 



m 



3.1 The Fast Directed Graph Model In this 
model; we consider only the total in- and out-degrees, 
ignoring reciprocity. 

To generate the Fast Directed (FD) model, we 
extend the Fast Chung-Lu (FCL) model for undirected 
graphs [17]. This model is based on the idea that each 
edge creation can be done independently if the degree 
distribution is given. The FCL reduces the complexity 
of the CL model from 0{n^) to 0{m), and the same can 
be done in the directed case. 

In the Chung-Lu model [3] , after m insertions (and 
assuming df'df^ < m for all i,j) the probability of 
edge {i,j) is 



dtdl 



Pij 



The naive approach flips a coin for each edge 
independently. The "fast" approach flips a coin to pick 
each endpoint. The probability of picking node i as 
the source is proportional to df' and the probability of 
picking node j as the destination is proportional to d'p . 

Our implementation works as described in Alg. 1. 
We first pick all the source nodes and then all the sink 
nodes using the weighted vertex selection described in 
Alg. 2. If we want 500 nodes of out-degree of 2, for 
example, we create a "degree-2 pool" of 500 vertices 
and pick from it a total of 1000 times in expectation by 
doing weighted sampling of the pools. Within the pool, 
we pick a vertex uniformly at random with the further 
expectation that each vertex in the pool will be picked 2 
times on average. In Alg. 2, the pool of degree-d vertices 
is denoted by Vd and the likelihood that the dth pool 
is selected is denoted by Wd. In all cases except d = 1, 
the size of the pool is defined by the number of vertices 
of that degree and the weight of the pool is the number 
of edges that should bc in that pool. The one exception 
is the degrec-1 pool which has a blowup factor b. For 
now, assume b = 1; we explain its importance further 
on in §3.3. At the end of Alg. 2, we randomly relabel 
the vertices so there is no correlation between the degree 
and vertex identifier. 



The FD method can produce repeat edges, unhke 
the naive version that flips weighted coins (one per 
edge). Nevertheless, this has not been a major problem 
in our experience. Another alternative to Alg. 2 is to put 
d copies of each degree-d vertex into a long array and 
then randomly permute it — this is the approach of the 
edge configuration model. This gives the exact specified 
degree distribution (excepting possible repeats) by using 
a random permutation of a length array. This would 
produce very similar results to what we show here, and 
is certainly a viable alternative. We also mention an 
alternate way of generating Chung-Lu graphs that could 
be adapted for the directed case [11]. 



Algorithm 1 Fast Directed Graph Model 

procedure FDModel(G',6'^,6=*') 
Calculate { nf" } and { } for G 
{ik} VertexSelect({ nf},b^) 
{ 3k } ^ VertexSelect({ } , b^) 
E^{iik,Jk)} 

Remove self-links and duplicates from E 
return E 
end procedure 



Algorithm 2 Weighted Vertex Selection 
procedure VertexSelect({ }, b) 

V = {l,...,n,} 

for all d = 1, . . . ,dmax do 
Wd d- rid/m 
if d > 1 then 

Vd ^rid vertices from V 
else 

Vi ^b ■ ni vertices from V 
end if 
V ^V\Vd 
end for 

for all = 1, . . . , TO do 

dfe ^ Random degree in { 1 , . . . , dmax } , 

proportional to weights {wd} 
ik Uniform random vertex in V/j 
end for 

V <r- unique indices in { ik 

TT Random mapping from V to { 1 , . . . , n } 
return { 7r(u-) 
end procedure 



3.2 The Fast-Reciprocal Directed graph model 

The FD model generates a directed graph and matches 
to the total in- and out-degree distributions. However, it 
produces virtually no reciprocal edges. To overcome this 
issue, we propose the Fast Reciprocal Directed (FRD) 
graph model. 

Here, the goal is quite simple; capture the reciprocal 
edges using an undirected model and the remaining 
directed edges using a directed model. After that, 
we blend the generated edges from each model in one 
model. In this case, we explicitly consider the three 
distributions, {nj*}, {n'^}, and {n~l^}. The method 
is presented in Alg. 3. 



Algorithm 3 Fast Reciprocal Directed Graph Model 

procedure FDModel(G',6", b^,b~^) 

Calculate { nj* }, { }, and { } for G 
{ik} VertexSelect({ inj* } , b^) 
{jk} <- VertexSelect({ } , b^) 
El <~ { {ik,jk), {jk,ik) } 
{ii} ^ VertexSelect({ } , b^) 
{ ii } ^ VertexSelect({ } , 6^) 
E^^{{iuji)} 
E i — E\ U E2 

Remove self-links and duplicates from E 
return E 
end procedure 



3.3 Fixing the Number of Degree-1 Nodes Be- 
low, we present our arguments for the case of the in- 
degree, but the same arguments applied to out-degree 
or reciprocal degree (with slightly more complexity in 
the reciprocal case which is omitted due to space). We 
use just the notation d to denote the in-degree, for sim- 
plicity. 

If we run VertexSelect (Alg. 2) repeatedly, al- 
ways assigning the same ids to each vertex pool and 
omitting the random relabeling (tt) at the end, each 
node will get its desired in-degree on average across mul- 
tiple runs. For any single run, however, this will not be 
the case. In fact, the degrees are Poisson distributed. 

Claim 3.1. The probability that a vertex v in pool Vd 
is selected x times is 

d^er'^ 

Prob { V selected x times \ v ^ Vd} = ; — . 

a;! 

This claim is easy to see. We expect that pool Vd 
will be selected Wd = d ■ Ud times. Therefore, each 
element of Vd will be selected an average of d times, 
so that is the Poisson parameter. (There may be some 
small variance in the number of times that each pool is 



selected, but the variance should be small enough not 
to greatly impact the average degree.) 

The effect of the Poisson distribution is particularly 
noticeable in the pool of degree- 1 nodes where the 
probability that a node in Vi has in-degree a; = 1 is 
only 36%. An additional 36% will have an in-degree of 
X = and the remaining 28% will an in-degree of a; > 2. 
Of course, there will be some contributions from the 
other pools, e.g., 1^2 will produce 27% degree-1 nodes. 
However, in a power law degree distribution, n2 ^ ni so 
its contribution is small. Nevertheless, we can calculate 
the expected number of degree- a; nodes by summing over 
the contributions across all degrees pools. 

Claim 3.2. Let n'^ denotes the number of nodes that 
are selected exactly x times. Then 



Ed-^e 



-d 



Again, the claim is easy to see and so the proof is 
omitted. 

For many real- world distributions, n'l <^ ni. We 
propose a workaround to this problem — we would like 
to reduce the number of nodes in Vi that are selected 
multiple times. To do this, we increase the size of the 
pool via a blowup factor b, which is used as follows. Let 
Vi contain b ■ ni nodes. The weight of the pool will not 
change, meaning that it will still be selected ni times. 
Therefore, we may make the following claim. 

Claim 3.3. The probability that a vertex v in pool Vi 
with b ■ ni elements is selected x times is 



Prob { V selected x times \ v £ Vi } 



/(&^ - a;!). 



Furthermore, the expected number of nodes in Vi that 
are selected exactly one time is ni -e^^^^. Hence, letting 
n',j, denote the number of nodes that are selected exactly 
X times, we have 



E(n^) = ni 



-i/b 



d>l 



nd- 



Proof. We still pick pool Vi a total of ni times, so that 
average (i.e., the Poisson parameter) for this pool is 
now reduced to ni/{ni ■ b) = 1/6 since there arc b ■ ni 
elements. 

The next equation comes from the fact that there 
are b ■ ni nodes in the pool, so we multiply the number 
of nodes with the probability of being picked x times 
with a; 1 to determine the expected number. 

Finally, the revised expectation comes from chang- 
ing the formula for the first pool to account for the 
enlarged pool size. 



If we choose, for example, b — 10, then we can 
expect that 0.9 • ui nodes in Vi to be selected exactly 
one time. We show an example of the impact of this 
modification in Figure 2, where we show the total in- 
degree for soc-Epinionsl with and without a blowup 
factor of b = 10. The degrees are logarithmicly binned 
and summed. Note that the match for the number of 
degree-1 nodes is improved, but there is small penalty 
in the match for degree-2 nodes. We use 6 = 10 in all 
experiments reported in this paper. 
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Figure 2: Example of in-degrec distribution with and 
without blowup factor. Note that the model with 
the blow-up factor matches degree-1 nodes precisely, 
however, the model without blow-up generates only half 
of the degree-1 nodes in the original graph. 



4 Experimental Studies 

We test our models on various directed networks such as 
citation (cit-HepPh), web (web-NotreDame) , and social 
(soc-Epinionsl, soc-Live Journal) [22]. We also test our 
models on large scale graphs coming from online social 
networks (youtube, flickr, liveJournal) [13]. We list the 
attributes of the networks in Table 1 after removing self- 
links and making the graph unweighted (simple). Note 
that reciprocity, r, is very low in the citation network. 
We elaborate how we fit the models to the real networks 
below. 

Fast Directed (FD) and Fast Reciprocal 
Directed (FRD) Our proposed models work directly 
with the appropriate degree distributions of the input 
graphs. We used a blowup factor of 5 = 10 in all cases. 

Forest Fire (FF) We provide the number of nodes 
n, and the forward and backward burning probabilities 
Pf and pb to the SNAP software [22]. To fit FF, we 
match the generated graph models to the number of 
edges in the real networks. For each target graph, 
we search a range of values by incrementing pf value 



Table 1: Networks used in this study. The value of r is the reciprocity measure, p/ is the forward burning 
parameter for FF, and the last column is the SKG initiator matrix. 



Graph Name 


Nodes 


Edges 


Rec. Edges 


r 


Pf 


SKG initiator 


cit-HepPh [22] 


34K 


421K 


<1K 


0.003 


0.37 


[0.990,0.440;0.347,0.538] [8] 


soc-Epinionsl [22] 


76K 


508K 


206K 


0.405 


0.346 


[0.999,0.532;0.480,0.129] [8] 


web-NotreDame [22] 


325K 


1,469K 


759K 


0.517 


0.355 


[0.999,0.414;0.453,0.229] [8] 


soc-LiveJournal [22] 


4,847K 


68,475K 


32,434K 


0.632 


0.358 


[0.896,0.597;0.597,0.099] [19] 


youtube [13] 


1,157K 


4,945K 


3,909K 


0.791 


0.335 




flickr [13] 


1,861K 


22,613K 


14,117K 


0.624 


0.355 




Live Journal [13] 


5,284K 


77,402K 


56,920K 


0.735 


0.355 





by 5p = 0.001 in range [0.2-0.5] to find the best 
model giving the similar number of edges to the original 
network; the values we use are reported in Tablet. We 
set pb = 0.32 as described in [10]. 

Stochastic Kronecker Graphs (SKG) Wc use 
the initiator matrices reported by previous studies: [8] 
for cit-HepPh, soc-Epinions, and web-NotreDame and 
[19] for soc-LiveJournal. We attempted to generate 
initiator matrices for large graphs using [22], but the 
program did not terminate within twenty-four hours. 
Therefore, we only fit SKG to the networks obtained 
from SNAP [22] data warehouse. We set the size of 
the final adjacency matrix is 2r'°S2(")l^ where n is the 
number of nodes in the real graph. 

Wc generate all the models in a Linux machine 
with 12GB memory and Intel Xcon 2.7 Ghz processor. 
The FD and FRD methods were implemented by us 
in MATLAB; the SKG and Forest Fire generation 
code were implemented in CH — h from [22]. For fair 
comparison, we do not time the printing or file saving 
parts from the Snap software. Graph generation time 
for each model is listed in Table 2. Among all of the 
results, FD and FRD are the fastest, in that order. SKG 
is little bit slower than both FD and FRD models. The 
forest fire is the slowest even though C-|— |- is much faster 
than MATLAB code. 



Table 2: Graph generation time 



Graph Name 


SKG 


FD 


FRD 


FF 


cit-HepPh 


2.17s 


0.16s 


0.19s 


18.80s 


soc-Epinions 


1.53s 


0.29s 


0.41s 


6.73s 


web-NotreDame 


4.95s 


0.56s 


0.62s 


29.66s 


soc-Live Journal 


6m51s 


31.15s 


41.75 


2h28m32s 


youtube 




2.16s 


2.53s 


2m22s 


flickr 




10.30s 


12.20s 


lhllm2s 


livejournal 




35.30s 


59.98s 


8h30ml8s 



We analyze the number of reciprocal edges gener- 
ated by each model in Table 3. The FF model cannot 
generate any reciprocal edges. The FD model can gen- 
erate a few random reciprocal edges but this number 
is negligible compared to the real number of recipro- 
cal edges. The SKG model generates some reciprocal 
edges; however, it is also much less than the real num- 
ber. The FRD model performs the best and generate 
correct amount of reciprocal edges. 

Table 3: Reciprocal Edges created by each model 



Graph Name 


Orig. 


SKG 


FD 


FRD 


FF 


cit-HepPh 


1071 


1160 


159 


1148 





soc-Epinionsl 


31K 


835 


86 


30K 





web-NotreDame 


89K 


5K 


27 


85K 





soc-Live Journal 


1.5M 


14K 


171 


1.5M 





youtube 


526K 




18 


499K 





flickr 


1.3M 




205 


1.3M 





livejournal 


4.1M 




258 


4.0M 






We also analyze the generated degree distributions 
by each model. The plot are log-binned for each of 
readability. Figure 3 shows the results on the soc- 
Epinionsl graph. Here we see that all four methods do 
fairly well in terms of matching the total in- and out- 
degree distributions. (The few low values for SKG are 
due to its well-known cycling behavior [18].) However, 
only the FRD method matches the reciprocal degree 
distribution. The FD and SKG methods produce far too 
few reciprocal edges and FF does not produce any. We 
see very similar behavior in Figure 4 for soc-LiveJournal, 
except here the FF and SKG degree distributions do not 
match the total out-dcgrce distribution very well. Once 
again, neither FD nor SKG produce many reciprocal 
edges and FF docs not produce any. Figures for cit- 
HcpPh and web-NotreDame are shown in the appendix. 

For larger graphs, we have not included SKG due 



10 



10 



10' 



10' 



10 



soc-Epinions1 



10' 



* True 

* FD 

* FRD 

* FF 

* SKG 



10 



10 



10* 



10 



soc-EpinionsI 



10" 



Si, 



* True 

* FD 

* FRD 

* FF 

, SKG 



*© 
* *% 



10" 



10 



10 



10 



10 



soc-EpinionsI 



•* 



* True 

* FD 

* FRD 

* FF 

* SKG 



10" 



10 10 10 10 

Total In Degree 



10 



10 10 10 10 

Total Out Degree 



10 



10 10 10 10 10 

Reciprocal Degree 



Figure 3: Comparisons of degree distributions produced by various models for graph soc-Epinionsl. 
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Figure 4: Comparisons of degree distributions produced by various models for graph soc-LivcJournal. 



to the expense of fitting the model. We do compare to 
FF, however, for the youtube and flickr graphs shown 
in Figures and Figure 6, respectively. After extensive 
tuning, FF is able to match the total in- and out-degree 
distributions fairly well. But it of course cannot match 
the reciprocal degree. 

Finally, we shown results just for our methods on 
the largest graph: livejournal in Figure??. We observe 
a very close match for the FRD method in all three 
distributions. 

5 Significance and Impact 

Directed networks have not received much attention in 
terms of generative models. An obvious first-level goal 
for a generative model would be to match the total 
in- and out-degree distributions of a given graph. We 
propose the FD model for this purpose. It is a fast 
variant of the directed Chung-Lu model [1, 3, 4], picking 
endpoints for each edge at random, proportional to each 



node's desired degree. This is a close cousin of the 
configuration model [14]. It compares favorably in both 
speed and accuracy to existing state-of-the-art models. 

Directed social networks, however, cannot be mod- 
elled well without considering reciprocal edges. The pro- 
posed FD model generates very few reciprocal edges. In 
fact, few models explicitly generate such edges. The 
FF model, for example, does not even have the capabil- 
ity to generate a reciprocal edge since new nodes only 
connect to older nodes and there is no capacity for the 
older nodes to add additional links (i.e., back to the new 
node). Probably the most direct approach to this was 
the extension of preferential attachment in [20], but it 
was unable to capture both in- and out-degree distribu- 
tions. 

To also address modeling or reciprocal edges, we 
propose an effective, if straightforward, FRD approach 
to modeling these networks by separately building mod- 
els for the reciprocal edges (modeled as an undirected 
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Figure 5: Comparisons of degree distributions produced by various models for grapli youtube. 
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Figure 6: Comparisons of degree distributions produced by various models for graph flickr. 



graph) and the non-reeiprocal edges (modeled as a di- 
rected graph using FD). We combine the two graphs in 
an unsophisticated way — simply random mapping the 
nodes in the undirected graph to the set of all nodes. A 
more sophisticated approach would be to combine the 
reciprocal and non-reciprocal edges in a way that re- 
spected the total in- and out-degree distributions. To 
our surprise, this did not seem to be necessary, indicat- 
ing that there is not a strong correlation in reciprocal 
and non-reciprocal edges in the graphs we studied. 

Compared to state-of-the-art directed graph mod- 
els such as FF [10] and SKG [8], both our FD and FRD 
methods are significantly faster and give a more accu- 
rate match to the various degree distributions. More- 
over, there is no fitting procedure to determine the pa- 
rameters of our FD and FRD methods. Even the best 
choice for the "blowup" parameter, if one wants to be 
very exact, can be determined from equations previously 
outlined. Of course, we would be remiss if we failed 



to point out that the FD and FRD models have many 
more parameters (i.e., the entire degree distributions) 
than FF and SKG, but this is still a very small amount 
of data compared to the overall sizes of the graphs. 

Our FD and FRD methods arc highly scalable since 
edge generation can be done in parallel, on multiple 
threads or across multiple machines in a distributed 
setting. The data required to generate each edge is the 
order of the size of the maximum degree, and so can be 
easily transmitted to multiple processors. 

These models also serve as a baseline for under- 
standing networks. For instance, a community is often 
defined as a subgraph with more edges than expected as 
compared to a random model. Similarly, baseline mod- 
els are generally useful in understand the significance of 
observed patterns such as directed triangles or more so- 
phisticated patterns. When analyzing social networks, 
explicit reciprocal edges are important since they are ex- 
tremely common and clearly play a role in community 
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Figure 7: Comparisons of degree distributions produced by various models for graph liveJournal. 



structure and other observed phenomena. 
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